Evaluation of the accuracy and readability of ChatGPT-4 and Google Gemini in providing information on retinal detachment: a multicenter expert comparative study

Background Large language models (LLMs) such as ChatGPT-4 and Google Gemini show potential for patient health education, but concerns about their accuracy require careful evaluation. This study evaluates the readability and accuracy of ChatGPT-4 and Google Gemini in answering questions about retinal detachment. Methods Comparative study analyzing responses from ChatGPT-4 and Google Gemini to 13 retinal detachment questions, categorized by difficulty levels (D1, D2, D3). Masked responses were reviewed by ten vitreoretinal specialists and rated on correctness, errors, thematic accuracy, coherence, and overall quality grading. Analysis included Flesch Readability Ease Score, word and sentence counts. Results Both Artificial Intelligence tools required college-level understanding for all difficulty levels. Google Gemini was easier to understand (p = 0.03), while ChatGPT-4 provided more correct answers for the more difficult questions (p = 0.0005) with fewer serious errors. ChatGPT-4 scored highest on most challenging questions, showing superior thematic accuracy (p = 0.003). ChatGPT-4 outperformed Google Gemini in 8 of 13 questions, with higher overall quality grades in the easiest (p = 0.03) and hardest levels (p = 0.0002), showing a lower grade as question difficulty increased. Conclusions ChatGPT-4 and Google Gemini effectively address queries about retinal detachment, offering mostly accurate answers with few critical errors, though patients require higher education for comprehension. The implementation of AI tools may contribute to improving medical care by providing accurate and relevant healthcare information quickly.


Background
Our clinical practice has already been transformed by the internet over the last few decades [1].In particular, recently introduced data-driven tools such as artificial intelligence (AI) have the potential to revolutionize healthcare even more in the future [2][3][4].This change is already underway, with more people turning to online platforms and self-diagnosis tools, such as symptom checkers [5] for healthcare information [6,7], particularly as accessing traditional face to face medical care becomes more challenging.However, these online tools often lack essential details to accurately assess symptom urgency [7].Yet, there may be a shift on the horizon.Recent initiatives by the World Health Organization (WHO) seek to set standards for AI-assisted healthcare technologies, encouraging additional exploration of their potential benefits [8].
Large language models (LLM) like ChatGPT-4 launched for public use in November 2022 and Google Gemini, released in December 2023 and renamed in February 2024 also offer advantages in patient health's education [9].However, there are concerns that while they can write persuasive texts, these can potentially be inaccurate, distorting scientific facts and spreading misinformation [9].
Providing accurate and timely healthcare information is critical in a serious eye condition that requires immediate treatment, such as acute retinal detachment (RD) or endophthalmitis.Prompt treatment is essential to reduce the risk of permanent visual deterioration, as duration of macula-involving RD is one of the few modifiable factors for a better postoperative visual outcome [10].Patients with acute RD often seek medical care sooner, are more conscious of the symptoms of RD [11], and tend also to be better educated [12].This suggests that raising awareness and educating patients about the classic signs of RD could not only result in more patients seeing an ophthalmologist while their macula is still attached but could also result in a better postoperative outcome for patients.
The aim of this study is to evaluate the readability and accuracy of ChatGPT-4 and Google Gemini in responding to queries about RD.

Methods
In our comparative study, we included 13 questions frequently asked by patients on topics such as symptoms, causes of retinal detachment, surgical techniques and follow-up treatment.These questions were categorized into three difficulty levels (D1-D3) by two vitreoretinal specialists (P.S. and R.G.) (Table 1).
To obtain the most precise and specialized answer possible, ChatGPT-4 (Generative Pre-trained Transformer), developed by OpenAI (San Francisco, CA, USA) and Google Gemini (Google DeepMind, London, United Kingdom) were instructed via a prompt to assume the role of an ophthalmologist when answering: Take the role of an ophthalmologist who answers patients' questions.Write in continuous text and exclude images and illustrations for explanation.Your task is to give a concise, specific answer that is accurate by current standards.The length of the answer should not exceed 150 words.
Each question was asked independently in a new chat window after the prompt was repeated, and the previous dialogue was deleted to avoid any possible interference of the previous questions and answers with the following ones.The evaluation criteria included the correctness, errors, thematic accuracy and coherence of the answers.

Evaluation of the answers
The answer options for each question in the online survey were organized as follows: Correctness (single answer) -Correct: The entire answer is correct.
-Partially incorrect: The core statement of the answer is correct, but the rest of the answer contains one or more errors.-Incorrect: The core statement of the answer is incorrect.

Error rating if applicable (multiple answers)
-Serious errors in content: The core statement of the answer AND / OR the rest of the answer contains one or more serious errors in content that could have serious consequences / pose a risk to patients.-Content errors: The core statement of the answer contains one or more content errors that do not pose a risk to patients OR the core statement of the answer is correct, but the rest of the answer contains one or more content errors that do not or only slightly change the core statement of the answer and do not pose a risk to patients.-Formal errors: The answer contains one or more grammatical or linguistic errors, for example, but these do not affect the core message of the answer or have any other significant consequences.

Thematic accuracy (single answer)
-Applicable: The answer identifies the central concept and is thematically specific.-Partially correct: The answer identifies the central concept, but also partially addresses an unrelated topic.-Not applicable: The answer does not identify the central concept and / or targets an unrelated topic.

Coherence (single answer)
-Coherent: The core message of the answer is fully supported by the rest of the answer.-Partially coherent: The core statement of the answer is essentially confirmed by the rest of the answer, but there are deviating statements / contradictions in the rest of the answer.-Incoherent: The core statement of the answer contradicts the rest of the answer.
For the parameter's correctness, thematic accuracy and coherence, only a single answer was possible; for error assessment, multiple answers or assessments of individual parts of the answer were possible due to the different error categories (content vs. formal errors).Our 13 masked questions and the corresponding answers from ChatGPT-4 and Google Gemini were sent online to ten experienced vitreoretinal specialists via the RedCap platform [13,14].
Each question was given an overall quality grading at the end in addition to the assessment of the correctness, accuracy, thematic accuracy and coherence of the answers.The overall quality grades were categorized based on the American GPA scoring system as follows: excellent = 4 points, good = 3 points, satisfactory = 2 points, sufficient = 1 points, bad = 0 points [15].

Evaluation of readability
The readability of all generated answers was analyzed with the online tool readable (Readable.com,Horsham, United Kingdom) with regard to number of words, number of sentences, number of words per sentence, number of long words (> 6 letters), Flesch Reading Ease (FRES) score [16] and reading level.
Flesch Readability Ease Score for evaluating the readability of a text is shown in Table 2.

Ethical considerations
In concordance with German legislation, an approval of a medical ethical committee was not needed for a study that did not include patient data.The study was performed in accordance with the ethical standards set forth in the 1964 Declaration of Helsinki.

Number of words
The mean number of words was 159 ± 20.6 and 155 ± 42.

Number of words per sentence
The mean number of words per sentence was 18.

Correctness
For the difficulty level 1 and 2, there was no significant difference between ChatGPT-4 and Google Gemini in terms of correctness (p = 0.5).The total number of correct versus partially correct answers in difficulty level 3 was 36 vs. 13 for ChatGPT-4 and 18 vs.30 for Google Gemini (p = 0.0005) (Table 3).

Thematic accuracy and coherence
The thematic accuracy (Table 5) and coherence (Table 6) showed higher scores for ChatGPT-4 compared to Google Gemini in terms of difficulty level 3 (p = 0.003), whereas there was no statistically significant difference for both LLMs in difficulty level 1 and 2.

Discussion
Retinal detachment (RD) is a sight-threatening eye condition that requires immediate surgical intervention to prevent permanent visual impairment.Providing timely and accurate health information is critical to patient understanding and treatment outcomes [10,12].In our study, ChatGPT-4 and Google Gemini showed promise in answering typical patient questions about RD.They delivered mostly correct and accurate responses with few serious errors.However, a college-level education is needed to comprehend the answers across various difficulty levels.
Large language models (LLMs) such as ChatGPT-4 and Google Gemini can provide health-related information to the users [18].ChatGPT-4 is an autonomous machinelearning system capable of quickly generating complex and seemingly intelligent text in a conversational style in multiple languages, including English [9,19].Key benefits include its accessibility, cost-free usage, user-friendliness, and ongoing enhancements [9].Consequently, it is conceivable that ChatGPT-4 could be used to help patients answer their health questions.The ability of ChatGPT-4 to respond to questions about medical examinations, including those related to ophthalmology [20,21], has  been the subject of great interest and has been investigated in several studies [22,23].
Both LLMs were instructed to provide answers of up to 150 words in length.However, the mean number of words exceeded this limit, with an average of 159 ± 20.6 for ChatGPT-4 and 155 ± 42.3 for Google Gemini.Regarding the mean number of sentences, there was no significant difference between both models, with averages of 9.1 ± 1.9 for ChatGPT-4 and 8.7 ± 3.2 for Google Gemini (p = 0.72).LLMs can exceed the word limits suggested in the prompts for several reasons.They interpret   prompts based on patterns from their training data, which may include longer responses.In particular, different text lengths in the training data can explain this behavior.Complex prompts may also require detailed explanations, leading to longer responses.Ambiguity in the instructions and the model's goal of providing relevant and coherent responses can also lead to exceeding the limit.Interestingly, Google Gemini required more sentences for the more difficult questions, with averages of 7.3 ± 1.5 for D1 and 12.2 ± 1.9 for D3 (p = 0.0007).
There was no difference between ChatGPT-4 and Google Gemini concerning the mean number of words per sentence.It was 18.3 ± 4.2 for ChatGPT-4 and 18.6 ± 3.1 for Google Gemini on average.The mean number of long words (defined as those with more than 6 letters) was 34.3 ± 4.5 for ChatGPT-4 and 29.7 ± 7.0 for Google Gemini (p = 0.76).The mean difference in the number of long words was significant between both AI tools for D1, with ChatGPT-4 exhibiting a higher count by 6.7 words on average (p = 0.04).
In terms of correctness, for the D1 and D2 questions, there was no significant difference between ChatGPT-4 and Google Gemini (p = 0.5).For D3, the total number of correct versus partially correct answers was 36 vs. 13 for ChatGPT-4 and 18 vs.30 for Google Gemini (p = 0.0005).However, it is important to note that opinions on specific retinal disease treatments may vary, even among retinal specialists, and thus may affect the analysis of correctness.The number of serious errors was altogether low, but higher for all difficulty levels in Google Gemini compared to ChatGPT-4 (D1: 1 vs. 0; D2: 4 vs. 2; D3: 4 vs. 1).In terms of thematic accuracy and coherence, Chat-GPT-4 showed higher scores compared to Google Gemini in terms of high difficulty level (p = 0.003), whereas there was no statistically significant difference for both LLMs in low and medium difficulty levels.
Public health professionals should pay attention to online health-seeking behaviors, weighing potential benefits, addressing quality concerns, and outlining criteria for evaluation of online health information [24].
More than one-third of adults in the United States routinely use the internet for self-diagnosis, for both non-urgent and urgent symptoms [6,7] Patients search for information via search engines like Google or Yahoo, as well as on health websites.This can help individuals to gain a deeper understanding of medical conditions alongside professional healthcare advice [25].However, the popular symptom-related websites of the major search engines often lack most of the information needed to make a decision about whether a particular symptom requires immediate medical attention [7].
Misdiagnosis by physicians occurs in approximately 5% of outpatients [26].In a study with a total of 118 physicians in the US correctly diagnosed 55.3% of easier and 5.8% of more difficult cases (p < 0.001) [27].When asked about the accuracy of their initial diagnosis received via Symptom Checker, 41% of patients said that a doctor had confirmed their diagnosis and 35% said that they had not seen a doctor for a professional assessment [6].An evaluation of 23 known symptom checker apps found that an appropriate categorization recommendation was made in 80% of emergencies, a rate comparable to doctors in training and nurses in training [27].An AI system known as Babylon AI, which is used in Africa for triage and diagnostic purposes, has shown that it is able to recognize the disease presented in a clinical case with an accuracy comparable to that of human doctors [28].
Importantly, ChatGPT-4, like other LLMs can generate persuasive and subtle [29] but often inaccurate text, sometimes referred as a 'hallucination' [30] leading to the distortion of scientific facts and the spread of misinformation [9].Importantly, the content of LLMs needs to be reviewed [29].Future discussion should focus on the how rather than the if of introducing this technology [19].
Our study has certain limitations.We only used the two best known LLMs to assess the questions.Further validation with multiple LLMs is needed.We only included the most common questions asked by patients, but this may not fully reflect the complexity of patient education.In addition, treatment recommendations may also vary between different ophthalmologists.Human-generated responses may also generate controversial opinions and should be further investigated in subsequent studies.In addition, the study is limited to the English language, which may not take into account the different levels of education and understanding of patients in other languages.We also did not address potential accessibility issues, such as visual impairment, that may hinder access to AI-based tools.In addition, the instructions were specific to the LLMs, which may not fully reflect how patients would utilize such technology.

Conclusions
To summarize, ChatGPT-4 and Google Gemini showed promise in answering questions about retinal detachment, providing mostly correct answers with few critical errors, even though patients need higher education with good reading comprehension to understand them.The use of AI tools may help to improve medical care by providing accurate and relevant health information quickly.
Based on the results of our study, LLMs show promise but are not yet suitable as a sole resource for patient education due to the risk of critical errors.We would suggest using these AI tools as supplementary rather than primary sources of information until further improvements are made to minimize errors and improve accessibility for a wider patient population.

Fig. 1
Fig.1 Flesch Readability Ease Score (FRES) for ChatGPT-4 and Google Gemini overall and for all difficulty levels (D1, D2, D3).The bars represent the mean FRES values, and the whiskers indicate the standard deviation (SD)

Fig. 2
Fig. 2 shows the quality grading in relation to the difficulty level (D1, D2, D3) for ChatGPT-4 and Google Gemini.The bars represent the mean quality grading values, and the whiskers indicate the standard deviation (SD)

Table 1
All 13questions sorted by difficulty level Q9What are the treatment options for retinal detachment?Q10 How exactly is a vitrectomy performed to treat a retinal detachment?Q11Which tamponades are used in vitrectomy for retinal detachment?Q12 How do gas tamponades differ from silicone oil tamponades in retinal surgery?Q13 What needs to be considered during postoperative care after vitrectomy?

Table 2
[17]table shows the FRE score with corresponding school level and description of the reading difficulty level[17]Statistical analysis was performed using GraphPad Prism10, Version 10.2.2 (341), (GraphPad Software, San Diego, USA) for Mac.For statistical analysis, categorical variables were presented as absolute and relative frequencies, whereas mean and standard deviation were computed for approximately normal-distributed continuous variables, otherwise median and interquartile range.Evaluation of data normality was performed using the Shapiro-Wilk test.Welch's t-test was used to evaluate the difference in means in both Large Language Models.Fisher's Exact Test was used to evaluate the association between categorical variables.Non-normally distributed continuous variables were compared using Mann-Whitney test.For multiple comparisons, Brown-Forsythe and Welch ANOVA test or non-parametric Kruskal-Wallis test and post hoc Dunn's test with correction for multiple testing were used.All statistical tests were two-sided and p-value < 0.05 was considered statistically significant.

Table 3
Correctness -number of correct, partially correct and incorrect answers for all 13 questions and difficulty levels

Table 4
Errors -number of serious errors, content and formal errors for all 13 questions and difficulty levels

Table 5
Thematic accuracy -number of applicable, partially applicable and not applicable answers for all 13 questions and difficulty levels

Table 6
Coherence -number of coherent, partially coherent and incoherent answers for all 13 questions and difficulty levels